An efficient ANFIS-EEBAT approach to estimate effort of Scrum projects

Software effort estimation is a significant part of software development and project management. The accuracy of effort estimation and scheduling results determines whether a project succeeds or fails. Many studies have focused on improving the accuracy of predicted results, yet accurate estimation of effort has proven to be a challenging task for researchers and practitioners, particularly when it comes to projects that use agile approaches. This work investigates the application of the adaptive neuro-fuzzy inference system (ANFIS) along with the novel Energy-Efficient BAT (EEBAT) technique for effort prediction in the Scrum environment. The proposed ANFIS-EEBAT approach is evaluated using real agile datasets. It provides the best results in all the evaluation criteria used. The proposed approach is also statistically validated using nonparametric tests, and it is found that ANFIS-EEBAT worked best as compared to various state-of-the-art meta-heuristic and machine learning (ML) algorithms such as fireworks, ant lion optimizer (ALO), bat, particle swarm optimization (PSO), and genetic algorithm (GA).

www.nature.com/scientificreports/ estimation like empirical, Delphi-Cost, etc., are not much suitable for estimation in the later 5 . Researchers have used machine learning techniques to bridge the gap of actual and estimated effort in agile inspired software and recorded a significant improvement. Agile aims to respond to changes positively thus soft computing techniques done justice by satisfying these inherent characteristics and provides reliable estimating. In a pool of wide variety of ML techniques, neuro-fuzzy frameworks 6 assists well in establishing complex relationships between various people, process, and project attributes. The uncertainty of requirements and less available historical data makes training difficult and predictions vague. ML techniques have been used in conjuncture with a wide variety of optimizations 7 like quality weighting in the analogy-based estimation, attribute weighting, tuning Artificial Neural Networks (ANN) adjustment (weight and bias), ANFIS adjustment, and variables positioning. Many prominent authors have reviewed and compared various regression-based and empirical techniques and found inferences wherein the former is outperforming the later with a significant margin. The study is not limited to few factors affecting estimation; instead, an exponential expansion can be seen vis-à-vis an increase in complexity of software projects. In some scenarios, conflicting outcomes have also been recorded in the literature irrespective of underlying process models. The estimation accuracy is changing with different datasets 8 and/or scenarios 9 using the same machine learning model. Authors are having conflicting interests' w.r.t regression and other machine learning model comparisons. ANN and Case-Based Reasoning (CBR) correlation analysis has been carried out in Ref. 8 and found ANN outperforms CBR whereas in Ref. 10 detailed the contrary outcome.
In agile state-of-the-art reports and majority of literary resources, IT stakeholders carry out story point estimation, using Analogy, Planning Poker (PP), Expert Judgment (EJ), etc. A very few ML techniques have been applied in the field of agile estimating; however, it is needed the most, because of requirements volatility. They have either applied alone or in blend with other machine learning or non-machine learning methods [11][12][13] . GA has been used with CBR, ANN, and Support Vector Regressor (SVR) for hyper parameter tuning. Fuzzy logic 8 , Decision Trees (DT), Bayesian Networks (BN) with reviews attempted in the field of effort and cost estimation 10,14,15 . In a recent study of agile effort estimation, Deep Belief Network-Ant Lion Optimizer (DBN-ALO) 16 hybrid approach has outperformed DT, Random Forest (RF) but they are expensive to train as it has complex data models. Authors in Ref. 17 have created ensemble of Analogy and Artificial Bee colony for software development effort estimation. Ensemble are evaluated in Ref. 18 and outperforming solo's. A hydrid system based on Firefly algorithm for predicting maintainability emphasize on the change and quality management is used in Ref. 19 . Based on the trends and recorded observations by researchers, ML assisted estimation related literature has been presented in Table 1.
All the techniques mentioned and discussed in this section are derived from general estimation approaches to demonstrate a trail of estimation trends.
Energy Efficient BAT (EEBAT) approach. The underlying architecture of the proposed approach has been inspired by the universal estimator i.e., Adaptive Neuro-Fuzzy Inference System 30 . ANFIS, in its original form, proved variously valued and promising solutions for problems of heavyweight process models in context to software estimation. ANFIS has some inherent pros and cons, which makes it a little less efficient for estimating in an agile environment if applied as a standard. Some shortcoming of ANFIS includes high computational cost due to complex structure and gradient learning hence for large inputs it will be slow, type and no. of membership functions, location of a membership function, curse of dimensionality and trade-off between interpretability i.e. rules and accuracy. As Agility is injecting 'change' , a de-facto ingredient in the reshaping the culture of Standard ANFIS architecture. Adaptive Neuro-Fuzzy Inference System, popularly known as a universal estimator and Takagi-Sugeno Fuzzy System makes use of potentials of both neural network and fuzzy logic in a package and is computationally more efficient than Mamdani, which mostly depends on the expert knowledge. The architecture of a standard ANFIS is given in Fig. 1 and has primarily five layers of perceptron's or neurons in which perceptron's or neurons in the identical layer are alike and have similar functionalities as follows • Fuzzifying layer: Each neuron is an adaptive node consisting of premise parameters.
• Implication layer: Neurons indicate the product of inputs.
• Normalizing layer: Each neuron is fixed.
• Defuzzyfing layer: Each neuron is also an adaptive one consisting of consequence parameters.
• Combining layer: It contains a single neuron that adds up all the inputs.
Energy efficient BAT approach. Standard BAT algorithm 31 has certain inherent issues like failure to converge to global optima, multimodal optimization, poor exploration, slow rate of convergence, and no population diversity. To address these issues, various BAT variants have been introduced by researchers across the globe like Adaptive multi-swarm bat algorithm (AMBA) 32 42 endorsed the use of optimization techniques to reduce and determine effort of software projects. The list of inferences that have been deduced from these variants are: Handling trade-off between exploration and exploitation, Converging to global optima instead of being trapped in local minima, Flexibility in the integration of the bat variants in different models, Diversity factor to maintain the distinctness of population and Improvising the algorithm for multimodal functions.
In our proposed algorithm, we update the standard bat algorithm by considering a new parameter called Energy which will update the position and velocity of the bat based on its distance from the prey. We propose two new factors for the energy parameter-eagerness and magnitude of work, that dynamically get updated for controlling exploration and exploitation trade-off. It becomes exhaustive for a bat or pair of bats to search for its target or prey due to continuous echolocation (lack of cognitive ability), exploration (failure to converge), and exploitation (trapping in local optima). To address these concerns, EEBAT is proposed. The distinctive features of the proposed algorithm are-the energy parameter and memory capability. The Energy Parameter, E can be calculated using Eq. (1).  Table 2). ANFIS based exhaustive search MATLAB functionality has been used to decide the inputs. Different inputs (as feature pairs) tested against Actual Time. The feature pair with minimum error has been chosen in input layer. www.nature.com/scientificreports/ where fitness i , is the fitness of the current bat. The population diversity due to energy lets the bat intelligently assess its capability thus improving time complexity and convergence. The mean of the best positions is taken to find a convergence junction, as every bat in the population finds a different position for one value of the parameter. These positions are the best solutions as evident by the fitness value calculated so the collective energy of these deduced positions determines their optimality. The memory capability of the bat, the population in standard bat has no history of the previous solutions encountered by the previous bats hence, novel solutions are left and premature convergence occurs. To solve this gap of the standard bat, the second improvement proposed is the introduction of memory capability. After every iteration, we store the position of bats in a special space called Memory Space (MS). This capability improves exploration as previously encountered solutions are prevented from being explored and exploited, hence improving the rate of convergence. This prevents the population from being trapped in local optima. It improves the time complexity of the algorithm.
Scrum effort estimation using ANFIS-EEBAT approach. ANFIS provides increased learning, adapting, and non-linear abilities, as it makes use of combined advantages of Neuro and Fuzzy inference systems and thereby can be trained without an explicit empirical knowledge pool. Despite carrying strong estimation capabilities, ANFIS architecture needs parameter adjusting and tuning. The objective function of the ANFIS-EEBAT approach is to optimize parameters of ANFIS using an energy-efficient BAT algorithm. To begin with, the system needs its food to start estimating the effort of new projects. Our approach depends on the training of certain project parameters which will be primarily inserted in the knowledge base. However, the data needs to be understandable, so before training, it is being passed from the data preparation module. This section discusses our proposed algorithm ANFIS-EEBAT in context to effort estimation.

Methodology
In "Methodology", we have considered Six Software houses agile project data, as sample inputs, to begin with, mentioned in Table 2. The algorithm of the proposed methodology is presented in four broad categories given below.
Dataset loading and feature selection. The dataset has been taken initially from six software houses which implemented agile-based projects and the following steps have been employed.
• Loading the agile project dataset.
• Perform a feature selection using an exhaustive search based on ANFIS.
Data set partitioning and model selection. The transformed data will be split into training and testing sets.
• Partitioning of transformed data into training and testing sets in the ratio 80:20.
• Train the ANFIS-EEBAT model using training data.
Testing part. In this part, model prediction on test data has been performed.
• Performing prediction using a trained model. • Comparing prediction results with the original dataset. • Mean absolute percentage error: It determines absolute accuracy for different estimation models. The term absolute is considered as the assessment of the cost estimations from the actual recognized costs. MAPE can be calculated using the Eq. (3).
In this, the first summation is done for each estimated point, divided by the number of suitable points N. • Prediction (PRED (x)): In mathematical definition, PRED(x) is mathematically determined as Eq. (4) PRED(x) value is calculated using the Eq. (5).
Here, 'N' represents the total of projects and 'K' is the count of projects having MRE below or equal to x. The value of x can be either 0.25, 0.50. 0.75 or 1.0. If a common value of x is 0.50, then PRED (0.50) refers to the % of projects whose MRE is less than or equal to 50%. Measuring the accuracy of estimation in scrum is an essential activity and determines its superiority with self and others.
• Perform model comparison using various performance metrics.
• Compare the output of the above defined metrics Deducing optimal parameters from EEBAT. The proposed system after the default initialization process will undergo tuning of base fuzzy system parameters by EEBAT. The inherent training algorithm of ANFIS will be replaced by EEBAT. The parameters of base fuzzy system will be adjusted based on fitness/error function Mean Magnitude of Relative Error (MMRE) which should be low, as given in Eq. (6).
Here, N is the number of projects in the dataset genfis is used as a base fuzzy system with fuzzy c-means clustering to create rules and input MFs in the forward pass. EEBAT will minimize the error in the backward pass run. The detailed supposition stages of effort estimation are given in Fig. 2.
Employing optimized parameters in ANFIS obtained by EEBAT. In this step, values of error metrics, e.g. MMRE will be observed. The optimized parameters obtained in the previous section will be initialized as default parameters of MFs of base fuzzy system.

Experimental results and discussion
The accuracy achieved by the system depicts the efficacy of the proposed system. Many researchers have presented their hybrid approaches by incorporating meta-heuristic algorithms for parameter(s) optimization. Table 2. The dataset has been taken from Zia 21 .

Renaming, identification and selection of features and labels. We have renamed few fields of
Dataset and performed ANFIS based exhaustive search to find the best combination of fields which is chosen as inputs aka features and is matched against output aka label. This exhaustive search has been carried out in MATLAB. Fields named "Effort", "V" and "Actual Time" from Table 2 is renamed to "No. of Story Points", "Velocity" and "Actual Effort" respectively. Table 3 shows that our label "Actual Effort" is mostly affected by "No. of Story Points" and "Velocity" with minimum value of Train error i.e., 0.6504. The other pairs (No. of Story Points − Team Size) and (Velocity − Team Size) has not been selected as the value of the train error is more visà-vis chosen pair. This section assists IT managers in making better decisions of features selection.
(2) www.nature.com/scientificreports/ The least indispensable features selection minimizes complexity and produce software effort estimation results in less time 43 .
The deduced features and label after renaming is given in Table 4.
Expansion of dataset using k-means SMOTE. We have applied k-means based Synthetic Minority Over Sampling Technique (SMOTE) using Eq. (7), a data augmentation technique on Zia dataset, to generate  Here, x is the element of minority class set A, is the element of a set A 1 which is calculated using k nearest neighbors of x, sampled at some rate N. The new dataset is labeled as ZKmS (Zia K-means SMOTE) and is being used in our ANFIS-EE-BAT model. Table 5. It includes count (number of projects in the dataset), mean, and standard deviation, minimum and maximum value of "No. of Story Points", "Velocity" and "Actual Effort" in dataset.

Descriptive statistics of the dataset. The descriptive statistics of ZKmS has been given in
The statistics "Count" with value 162 signifies that ZKmS contains 162 projects data. "Mean" represents the average value of the fields. "Std" is the standard deviation which represents the difference of the field values from the Mean value. "Min" and "Max" show the minimum and maximum value respectively.
Model selection. ANFIS-EEBAT has been applied to the features from the dataset as per the step given below.
Data loading and generate fuzzy inference system. After we input features in the proposed ANFIS-EEBAT model, the antecedent layer creates the input MFs. The initial set of parameters for ANFIS and EEBAT are given in Table 6. The values of ANFIS parameters have been optimized using EEBAT.
Building ANFIS-EEBAT model structure. After setting up the initial parameters, the proposed model's structure is shown in Fig. 3.
The ANFIS and EEBAT parameters are explained in the Table 6. The Number of inputs is "2" which are "No. of Story Points" and "Velocity". The Number of outputs is "1" which is "Actual Effort". The learning algorithm is "EEBAT". The value "4" in number of inputs MFs parameter signify that there exists 4 Gaussian MFs for each input with unique set of Gaussian parameters. "Fuzzy C-Means" partitioning method has been employed which is used to create base fuzzy inference sys-tem. The input MF is "gaussmf (Gaussian)" that represents our data in normal distribution and the output MF is "linear" which produces a singular value. The base fuzzy system is created using "genfis3" functionality of MATLAB. The "And" method signifies the product of weights of neurofuzzy system with the in-puts. The "Or" method utilizes "probor (probabilistic or)" which is the algebraic sum of the previous layers. The implication and aggregation are set to "min" and "max" respectively. "wtaver" i.e., weightage average is used for defuzzification. The training iterations aka epochs are set to 100 as after this value over fitting occurs. The iterations have been validated against several trials. The error tolerance is set to 1e−5. The initial BAT population is set to "40". The maximum number of iterations is "100". Pulse rate signifies optimal solution searching precision of the algorithm. The tuning parameters of ANFIS are the optimal solution. Loudness controls the speed of convergence of the algorithm. The value of fmin and fmax determines the range of frequency, which assists in global searching capability. Alpha and gamma are constants. The values for each parameter are obtained during several exhaustive trials.  www.nature.com/scientificreports/

ANFIS-EEBAT MFs and rules view.
After the training and testing, membership function parameters are adjusted using EEBAT and can be seen in Fig. 4a,b. The rules for the same are shown in Fig. 5. Fig. 6 depicts the mapping of the features with the labels. It can be deduced from the surface plot that for our features, the output is linear, which is following the Takagi Sugeno type 3 Fuzzy Inference System (FIS).

ANFIS-EEBAT performance evaluation. ANFIS-EEBAT model's performance has been evaluated using
various metrics such as Squared Correlation Coefficient (R 2 ), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), MMRE, and PRED and is given in Table 7 for ZKmS and Zia datasets. ANFIS-EEBAT has also been compared with other state-of-the-art models on aforementioned datasets and summarized in Tables 8 and 9. Our approach is accurate to 98.47% and 99.93% on ZKmS and Zia datasets, respectively and will assist the IT industry stakeholders in getting accurate estimates of their respective projects. It also provides 100% estimation accuracy up to 2.4% for PRED. The RSquare value for the ANFIS-EEBAT is very high and can be seen in the Table 7 (0.98472 for ZKmS and 0.99934 for Zia datasets). As a result,  Figure 3. ANFIS-EEBAT structure. It contains five layers, as discussed in Fig. 1. There are two inputs, four pairs of input MFs, four set of rules, four output MFs and one output. The two inputs are "No. of Story Points" and "Velocity". The output is "Estimated Effort". The operations performed at different layers are synonymous with the description in Fig. 1. There are three basic logical operations, "and", "or", "not" depicted in the figure with three color codes "blue", "red" and "green" respectively. The rules are created using logical "and" operation in our case. The logical "or" and "not" operations are unused.  www.nature.com/scientificreports/

Statistical validations
Because the dataset in software effort estimating studies does not fit into any particular distribution, nonparametric tests are advised 44 . As per the nature of our data, non-parametric tests such as Friedman 45 have been applied to the ZKmS dataset using SPSS. The average ranking of the models using this test is shown in Fig. 8. The test provides the lowest rank to the best technique.     Table 9. Results on Zia with other techniques. *The data of performance metrics is not available in the referred research papers. This comparison has been performed on the real agile projects. It can be inferred that in spite of good accuracies by Fireworks optimized neural network and Deep Belief Network-Ant Lion Optimizer (DBN-ALO), a gap of actual and estimated effort is still present. This gap has been further narrowed down using ANFIS-EEBAT approach with a PRED of 100 close to 2.4%. Figure 7 depicts the Box Plot of ANFIS-EEBAT with other models.

Threat to validity
The dataset has been generated using SMOTE (k-means) from the original agile data taken from six software houses and the proposed algorithm has been applied on both Zia and ZKmS to validate its efficiency. However, it can be validated on more datasets.

Concluding remarks
Estimation is an indispensable requisite that assist project managers to take firm decisions and fulfilling client commitments. As per the current literature, during the start of any typical IT project, managers primarily depend upon the empirical estimation. Due to the complex nature of projects, estimation based on an educated guess does not yield fruitful results. Machine Learning assisted estimation, narrows down the gap of actual and estimated effort to a substantial level. We have attempted to bridge the aforementioned gap to a greater extent using the ANFIS-EEBAT approach. Our approach is making use of the three capabilities viz, neural networks, fuzzy, and novel BAT. The complexity of the proposed algorithm is managed by our novel energy equation and memory space concept. This work can be extended using other optimization algorithms like firefly, Sail Fish Optimizer.

Data availability
The data shall be made available on request.